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Abstract 

A common goal of privacy research is to release synthetic data that satisfies a formal privacy guarantee 
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■ and can be used by an analyst in place of the original data. To achieve reasonable accuracy, a synthetic 

data set must be tuned to support a specified set of queries accurately, sacrificing fidelity for other queries. 

This work considers methods for producing synthetic data under differential privacy and investigates 
what makes a set of queries "easy" or "hard" to answer. We consider answering sets of linear counting 
queries using the matrix mechanism [15], a recent differentially-private mechanism that can reduce error 
P2 . by adding complex correlated noise adapted to a specified workload. 

Our main result is a novel lower bound on the minimum total error required to simultaneously release 
answers to a set of workload queries. The bound reveals that the hardness of a query workload is related 
C/3 . to the spectral properties of the workload when it is represented in matrix form. The bound is tight 

^ ' and, because it satisfies important soundness criteria, it can serve as a reliable numerical measure of the 

"error complexity" of a workload. 
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G^ I 1 Introduction 

m ■ Differential privacy [8] is a rigorous privacy standard offering participants in a data set the appealing guar- 

antee that released query answers will be nearly indistinguishable whether or not their data is included. The 
I earliest methods for achieving differential privacy were interactive: an analyst submits a query to the server 

. and receives a noisy query answer. Further queries may be submitted, but increasing noise will be added 

and the server may eventually refuse to answer subsequent queries. 

To avoid some of the challenges of the interactive model, differential privacy has often been adapted to a 
non-interactive setting, where a common goal has been to release a synthetic data set that the analyst can 
^ ■ use in place of the original data. There are a number of appealing benefits to releasing a private synthetic 

^ \ database: the analyst need not carefully divide their task into individual queries, and can use familiar data 

processing techniques on the synthetic data; the privacy budget will not be exhausted before the queries 
of interest have been answered; and the analyst can carry out data analysis using their own resources and 
without revealing their tasks to the data owner. 

There are limits, however, to private synthetic data generation. When a synthetic datasct is released, 
the server no longer controls how many questions the analyst computes from the data. Dinur and Nissim 
showed that accurately answering "too many" queries of a certain type is incompatible with any reasonable 
notion of privacy, allowing reconstruction of the database with high probability [6] . 

This tempers the hopes of private synthetic data to some degree, suggesting that if a synthetic dataset is 
to be private, then it can be accurate only for a specific class of queries, and may need to sacrifice accuracy 
for other queries. A number of methods have been proposed for releasing accurate synthetic data for specific 
sets of queries [5, 15, 13, 22, 23, 3, 1, 19]. These results show that it is still possible to achieve many of the 
benefits of synthetic data if the released data is targeted to a workload of queries that are of interest to the 
analyst. 
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In general, differentially private algorithms for answering sets of queries with minimum error are not 
known. The goal of our work is to develop tools that can explain what we informally term the error 
complexity of a given workload, which should measure, for fixed privacy parameters, the accuracy with 
which we can simultaneously answer all queries in the workload. 

Such tools can help us to answer a number of natural questions that arise in the context of private 
synthetic data generation. Why is it possible to answer one set of queries more accurately than another? 
What properties of the queries, or of their relationship to one another, influence this? Can lower error be 
achieved by specializing the query set more closely to the task at hand? Does the combination of multiple 
users' workloads severely impact the accuracy possible for the combined workload? 

Naive approaches to understanding the "hardness" of a query workload are unsatisfying. For example, 
one may naturally expect that the larger the workload, the greater the error in simultaneously answering it. 
Yet the number of queries in a workload is usually an inadequate measure of its hardness. Query workload 
sensitivity [8] is another natural approach. Sensitivity measures the maximum change in all query answers 
due to an insertion or deletion of a single database record. Basic differentially private mechanisms (e.g. the 
Laplace mechanism) add noise to each query in proportion to sensitivity, and in such cases sensitivity does 
in fact determine error rates. But better mechanisms can reduce error when answering workloads of queries, 
so that sensitivity alone fails to be a reliable measure. 

One of the few existing results to shed light on these questions uses tools of learning theory to show that, 
for a general class of queries, it is always possible to release a differentially private synthetic database whose 
error bounds grow with the Vapnik-Chervonenkis (VC) dimension [21] of the workload and the size of the 
database. But the VC dimension is a rather coarse measure, and we will see that workloads with equivalent 
VC dimension may have significantly different error properties. 

In this paper we seek a better understanding of workload error complexity by reasoning formally about 
the minimum error achievable for a workload. We pursue this goal in the context of a class differentially 
private algorithms: namely those that are instances of the matrix mechanism (so named because workloads 
are represented as matrices and analyzed algebraically). The matrix mechanism can be used to answer sets 
of linear counting queries, a general class of queries which includes all predicate counting queries, histogram 
queries, marginals, data cubes, and others. 

The matrix mechanism exploits the relationships between queries in the workload to construct a complex, 
correlated noise distribution that offers lower error than standard mechanisms. It encompasses a range of 
possible approaches to differentially private query answering because it must be instantiated with a set of 
queries, called the strategy, which it uses to derive accurate answers to the workload queries. This makes 
the mechanism quite general, since any strategy can be selected. It includes as a special case a number of 
techniques proposed recently for range-count queries [15, 22], for computing sets of low order marginals [1], 
and for sets of data cubes [5]. 

We use the optimal error achievable under the matrix mechanism as a proxy for workload error complexity. 
Unfortunately, it is computationally infeasible to compute the optimal strategy for a given workload, making 
the assessment of workload complexity a challenge. Nevertheless, we are able to resolve this challenge through 
the following contributions: 

• Our main result is a novel lower bound on the minimum total error required to simultaneously release 
answers to a set of workload queries. The bound reveals that the "hardness" of a query workload is 
related to the eigenvalues of the workload when it is represented in matrix form. 

• We show that the bound is tight and characterize a class of workloads for which our lower bound 
is attainable. As a consequence, it is possible to directly construct a minimum error mechanism for 
workloads in this class. We also analyze the cases for which our bound is not tight. 

• We provide theoretical tools to enable the efficient computation of our bound for realistic scenarios, 
show how the bound can aid users of a differentially private interface in the design of query workloads, 
and compare the bound with measures based on VC dimension. 

Section 2 reviews definitions and Section 3 presents the matrix mechanism and properties of workload 
error. In Section 4 we present the lower bound and its proof. Section 5 evaluates the tightness and looseness 
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of the bound. In Section 6 we show how the bound interacts with algebraic operations on workloads. Lastly, 
in Section 7 we consider efficiency of computation, practical applications, and show an elegant formulation 
of VC dimension in terms of matrix properties. To aid intuition, wc include throughout the paper a scries 
of examples in which we compute our lower bound on workloads of interest and report concrete results. The 
appendix includes proofs of results not included in the body of the paper. 

2 Definitions &; Background 

In this section we describe our representation of query workloads as matrices, formally define differential 
privacy, and review linear algebra notation. 

2.1 Data model and linear queries 

The queries considered in this paper are all counting queries over a single relation. Let the database / be an 
instance of a single-relation schema R{A), with attributes A = {^1,^2, . . . , 

Am}- The crossproduct of the attribute domains, written dom(A) = dom{Ai) x • • • x dom{Am), is the set 
of all possible tuples that may occur in /. 

In order to express our queries, we encode the instance / as a vector x consisting of cell counts, each 
counting the number of tuples in / satisfying a distinct logical cell condition. 

Definition 2.1 (Cell Conditions). A cell condition is a Boolean formula which evaluates to True or False 
on any tuple in dom{A). A collection of cell conditions $ = 9^1,02 •••0n is an ordered list of pairwise 
unsatisfiable cell conditions: each tuple in dom{A) will satisfy at most one 4n. 

The data vector is formed from cell counts corresponding to a collection of cell conditions. 

Definition 2.2 (Data vector). Given instance I and a collection of cell conditions $ = cjji, (f>2 ■ ■ ■ 4>n, 
the data vector x is the length-n column vector consisting of the non-negative integral counts Xi ~ |{^ G 
/ I (/>i(t) is True}\. 

We may choose to fully represent instance / by defining the vector x with one cell for every element of 
dom(A). Then x is a bit vector of size \dom{A)\ with nonzero counts for each tuple present in /. (This is 
also a vector representation of the full contingency table built from /.) Alternatively, it may be sufficient 
to partially represent / by the cell counts in x, for example by focusing on a subset of the attributes of A 
that are relevant to a particular workload of interest. Because workloads are finite, it is always sufficient to 
consider a finite list of cell conditions, even if attribute domains are infinite. 

Example 2.1. Consider the relational schema R = {name, 

gradyear, gender, gpa) describing students and suppose we wish to form queries over gradyear and gender 
only, where dom(gender) = {M,F} and dom(gradyear) = {2011,2012, 2013,2014}. Then we can define 8 
cell conditions which result from all combinations of gradyear and gender. These are enumerated in Table 
1(a). 

Given a data vector x, queries are expressed as linear combinations of the cell counts in x. 
Definition 2.3 (Linear counting query). A linear 

counting query is a length-n row vector q = [qi . . . with each qi G M. The answer to a linear counting 
query q on x is the vector product qx ^ qiXi + ■ ■ ■ + qnXn- 

If q consists exclusively of coefficients in {0, 1}, then q is called a predicate counting query. In this case, 
q counts the number of tuples in / that satisfy the union of the cell conditions corresponding to the nonzero 
coefficients in q. 
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Table 1: For schema R = (name, gradyear, gender, gpa), (a) shows 8 ceU conditions on attributes gradyear 
and gender. The database vector x (not shown) wih accordingly consist of 8 counts; (b) shows a sample 
workload matrix W consisting of five queries, each described in (c). 
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(c) Counting queries defined by rows of W 
qi: all students; 

q2: students with gradyear G [2011, 2012]; 
qs: female students with gradyear £ [2011, 2012]; 
q4: male students with gradyear G [2011,2012]; 
qs: difference between 2013 grads and 2014 grads. 



2.2 Query workloads 

A workload is a finite set of linear queries. A workload is represented as a matrix, each row of which is a 
single linear counting query. 

Definition 2.4 (Query matrix). A query matrix is a collection ofm unique linear counting queries, arranged 
by rows to form an m x n matrix. 

Note that cell condition 0^ defines the meaning of the i*'* position of x, and accordingly, it determines 
the meaning of the i*'' column of W. Unless otherwise noted, we assume all workloads are defined over the 
same fixed set of cell conditions. 

Example 2.2. The matrix in Table 1(b) shows a workload of five queries. The first four are predicate 
queries. Table 1(c) describes the meaning of the queries w.r.t. the cell conditions in Table 1(a). 

We assume that workloads consist of unique queries, without duplicates. If workload W is an m x n 
query matrix, the answers for W are represented as a length m column vector of numerical query results, 
which can be computed by multiplying matrix W by the data vector x. 

Note that it is critical that the analyst include in the workload all queries of interest. In the absence 
of noise introduced by the privacy mechanism, it might be reasonable for the analyst to request answers 
to a small set of counting queries, from which other queries of interest could be computed. (E.g., it would 
be sufficient to recover x itself by choosing the workload defined by the identity matrix.) But because the 
analyst will receive private, noisy estimates to the workload queries, the error of queries computed from their 
combination is often increased. Our privacy mechanism is designed to optimize error across the entire set of 
desired queries, so all queries should be included. 

As a concrete example, in Table 1(b), q4 can be computed as (q2 — qa) but is nevertheless included in the 
workload. This reflects the fact that we wish to simultaneously answer all included queries with minimum 
aggregate error, treating each equally. It is also possible to scale individual rows by a positive scalar value, 
which has the effect of reducing the error of that query. 

The cell conditions are used to define the semantics of the queries in a workload, while the workload 
properties we study are primarily determined by features of their matrix representation. This can lead to a 
few representational inconsistencies we would like to avoid. First, while a workload W is meant to represent 
a set of queries, as a matrix it has a specified order of its rows. Further, for any workload W defined by cell 
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conditions $ = (pi . . .4>n, consider any permutation $' of $. Then there is a different matrix W, defined 
on $' and constructed from W by applying the permutation to its columns, that is semantically equivalent 
to W. We will verify later that our analysis of workloads is representation independent for both rows and 
columns. We use the following definition: 

Definition 2.5 (Representation independence). A numerical measure p on a workload matrix is row (resp. 
column) representation independent if, given a workload matrix W, and any workload W which results from 
permuting the rows (columns) ofW, p(W) ~ p(W'). 

A related issue arises in the specification of the cell conditions. If a workload W is defined by cell 
conditions in $, then we can always consider extending $ by adding additional cell conditions not relevant 
to the queries in W. We can then define a semantically equivalent workload W on which will consist of 
columns of zeroes for each of the new cell conditions. To address this issue in later sections we rely on the 
following definition: 

Definition 2.6 (Column Projection). Given an m x n workload W defined by cell conditions <& = ...(/>„ , 

and an ordered subset ^' C $ consisting of p selected cell conditions, the column projection o/ W w.r.t ^' is 
a new m x p workload consisting only of the columns o/W included in 

In the rest of the paper, /i denotes a subset of cell conditions, denotes its cardinality and Un to denote 
all possible subsets of n cell conditions. yu(W) is the column projection of W w.r.t. ^. 

Example 2.3. The column projection of the workload in Table 1(b) w.r.t. cell conditions {01,03,05,07}, 
which consists of queries only over Male students, is the following: 

"111 1' 
110 

110 
1-1 

Common workloads Wc introduce notation for two common workloads that contain all queries of a 
certain type. ALLRANGE(rf) denotes the workload consisting of all range-count queries over d cell conditions 
(typically derived from a single ordered attribute Ai with |dom(yli)| = d). There are ^{d + 1) queries in 
workload ALLRANGE((i). Over k ordered attributes with domain sizes di . . .dk, we similarly define the set 
of all k-dimensional range-count queries, denoted ALLRANGE((ii, . . .dk). 

ALLPREDiCATE((i) is the much larger workload consisting of all predicate counting queries over d cell 
conditions. There are 2^^ queries in ALLPREDiCATE((i). 

Example 2.4. The workload of queries counting the students who have graduated in any interval of years 
drawn from {2011,2012,2013,2014} is denoted AllRange(4) and consists of 10 one- dimensional range 
queries over the gradyear attribute. 

Example 2.5. The set of all two dimensional range-count queries over gradyear and gender is written 
AllRange(4, 2), where the possible "ranges" for gender are simply M , F, or (M V F). This workload con- 
sists of SO queries. The workload of all predicate queries over over gradyear and gender is AllPredicate(8), 
consisting of all 256 predicate queries over a domain of size 8. 

2.3 Differential privacy and basic mechanisms 

Standard e-differential privacy [8] places a bound (controlled by e) on the difference in the probability of 
query answers for any two neighboring databases. For database instance /, we denote by nbrs{I) the set of 
databases formed by adding or removing exactly one tuple from /. Approximate differential privacy [7, 17], 
is a relaxation in which the e bound on query answer probabilities may be violated with small probability, 
controlled by 5. 
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Definition 2.7 (Differential Privacy). A randomized algorithm K. is {e^ 5) -differentially private if for any 
instance I, any I' € nhrs{I), and any subset of outputs S C Range{IC) , the following holds: 

Pr[K.{I) &S]< exp(e) x Pr[JC{I') e S] + S 

where the probability is taken over the randomness of the IC. 

When (5 = 0, this definition describes standard e-differential privacy. 

Both definitions can be satisfied by adding random noise to query answers. The magnitude of the required 
noise is determined by the sensitivity of a set of queries: the maximum change in a vector of query answers 
over any two neighboring databases. However, the two privacy definitions differ in the measurement of 
sensitivity and in their noise distributions. Standard differential privacy can be achieved by adding Laplace 
noise calibrated to the Li sensitivity of the queries [8]. Approximate differential privacy can be achieved 
by adding Gaussian noise calibrated to the L2 sensitivity of the queries [7, 17]. This small difference in 
the sensitivity metric — from Li to L2 — has important consequences for the theory underlying our analysis 
and, unless otherwise noted, stated results apply only to approximate differential privacy. Sec 4.3 contains a 
comparison of these two definitions as they pertain to the matrix mechanism and the results of this paper. 

Since our query workloads are represented as matrices, we express the sensitivity of a workload as a 
matrix norm. Notice that for neighboring databases / and /', |(/ — /') n (/' — /)| = 1 and recall that all 
cell conditions are mutually unsatisfiable. It follows that the corresponding data vectors x and x' differ in 
exactly one component, by exactly one. We extend our notation and write x' G n6rs(x). The L2 sensitivity 
of W is equal to the maximum L2 norm of the columns of W. Below, cols(W) is the set of column vectors 
Wi of W. 

Definition 2.8 {L2 Query matrix sensitivity). The L2 sensitivity of a query matrix W is denoted Aw dTid 
defined as follows: 

Aw =^ max ||Wx-Wx'||2= max ||Wi||2 

x'enbrs{x) WiGColsi^W) 

The classic differentially private mechanism adds independent noise calibrated to the sensitivity of a 
query workload. We use Normal((T)™ to denote a column vector consisting of m independent samples drawn 
from a Gaussian distribution with mean and scale a. 

Proposition 2.1. (Gaussian mechanism [7, 17]) Given an m x n query matrix W, the randomized 
algorithm Q that outputs the following vector is [e^ 5) -differentially private: 

g{W, x) = Wx + NormaliaJ" 

where a = Aw \/ 2 ln(2 /5) /e 

Recall that Wx is a vector of the true answers to each query in W. The algorithm above adds independent 
Gaussian noise (scaled by the sensitivity of W, e, and 5) to each query answer. Thus ^(W, x) is a length-7Ti 
column vector containing a noisy answer for each linear query in W. 

2.4 Linear algebra notation 

Throughout the paper, we use the notation of linear algebra and employ standard techniques of matrix 
analysis. Recall that for a matrix A, A-^ is its transpose, A^^ is its inverse, and trace(A) is the sum of 
values on the main diagonal. The Frobenius norm of A is denoted ||A||f and defined as the square root of 
the squared sum of all entries in A, or, equivalently, y'trace(A'^A). We use diag{ci, . . .c„) to indicate an 
n X n diagonal matrix with sealars Ci on the diagonal. We use 0™^" to indicate a matrix of zeroes with m 
rows and n columns. An orthogonal matrix Q is a square matrix whose rows and columns are orthogonal 
unit vectors, and for which = Q^^. 

We will also rely on the notion of a positive semidefinite matrix. A symmetric square matrix A is called 
positive semidefinite if for any vector x, x-^Ax > 0. If A is positive semidefinite, denoted A ^ 0, all its 
diagonal entries are non-negative as well. In particular, for any matrix A, A"^A is a positive semidefinite 
matrix. 
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In addition, throughout the paper, we use A"*" to represent the Moore-Penrose pseudoinverse of a matrix 
A, a generahzation of the matrix inverse defined as follows: 

Definition 2.9. (Moore-Penrose Pseudoinverse [2]) Given a m x n matrix A, a matrix A+ is the 
Moore-Penrose pseudoinverse of A if it satisfies each of the following: 

AA+A = A, A+AA+ = A+, (AA+)'^ = AA+, (A+A)^ = A+A. 
We include some important properties of the Moore-Penrose pseudoinverse in the following theorem. 
Theorem 2.1. ([2]) The Moore-Penrose pseudoinverse satisfies the following properties: 

1. Given any matrix A, there exists a unique matrix that is the Moore-Penrose pseudoinverse of A.. 

2. Given a vector y, we have ||y — Ax||2 > ||y — AA+y||2 for any vector x. 

3. For any satisfiable linear system BA = W, WA+ is a solution to the linear system and HWA^'Hi? < 
||B||i? for any solution B to the linear system. 

3 The (e, (5)-Matrix Mechanism 

In this section we define the class of algorithms that can be constructed using the matrix mechanism, we 
define optimal error of a workload with respect to this class, and we develop notions of workload equivalence 
and containment consistent with our error measures. 

3.1 The extended matrix mechanism 

The matrix mechanism [15] has a form similar to the Gaussian mechanism in Prop. 2.1, but adds a more 
complex noise vector. It uses a different set of queries (the strategy matrix A) to construct this vector. 
The intuitive justification for this mechanism is that it is equivalent to the following three-step process: (1) 
the queries in the strategy are submitted to the Gaussian mechanism; (2) an estimate x for x is derived by 
computing the x that minimizes the squared sum of errors (this step consists of standard linear regression 
and requires that A be full rank to ensure a unique solution); (3) noisy answers to the workload queries are 
then computed as Wx. 

We present an extended version of the matrix mechanism which relaxes the requirement that A be full 
rank. Instead it is sufficient that all queries in W can be represented as linear combinations of queries in A. 
Then estimating x is not necessary, and step (2) and (3) can be combined. 

Proposition 3.1 (Extended (e, (5)-Matrix Mechanism). Given an mx n query matrix W, and a px n 
strategy matrix A such that WA+A = W, the following randomized algorithm Ma is {e, S)- differentially 
private: 

A^a(W,x) = Wx + WA+Normal{a)"\ 

where a AA\/2\n{2/S)/e. 

The proof is contained in App. A.l. Here the condition WA+A = W guarantees that the queries in A 
can represent all queries in W. When A has full rank, A+ = (A^A) A^, which coincides with the original 
definition [15]. 

Like the Gaussian mechanism, the matrix mechanism computes the true answer vector Wx and adds 
noise to each component. But a key difference is that the scale of the Gaussian noise is calibrated to the 
sensitivity of the strategy matrix A, not that of the workload. In addition, the noise added to the list of 
query answers is no longer independent, because the vector of independent Gaussian samples is transformed 
by the matrix WA+. 

The matrix mechanism can reduce error, particularly for large or complex workloads, by avoiding re- 
dundancy in the set of desired workload queries. Intuitively, some workloads consist of queries that ask the 
same (or similar) questions of the database multiple times, which incurs a significant cost to the privacy 
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budget under the Gaussian mechanism. By choosing the right strategy matrix for a workload, it is possible 
to remove the redundancy from the queries submitted to the privacy mechanism and derive more accurate 
answers to the workload queries. 

Fundamental to the performance of the matrix mechanism, is the choice of the strategy matrix which 
instantiates it. One naive approach is to minimize sensitivity. The full rank strategy matrix with least 
sensitivity is the identity matrix, I, which has sensitivity 1. With A = I, the matrix mechanism privately 
computes the individual counts in x and then uses them to estimate any desired workload query. At the 
other extreme, the workload itself can be used as the strategy, setting A = W. In this case, there is no 
benefit in sensitivity over the Gaussian mechanism^. 

For many workloads, neither of these basic strategies offer optimal error. Recent research has shown 
that for specific workloads, there exist strategies that can offer much better error rates. For example, if 
W = ALLRANGE(n), two strategies were recently proposed. A hierarchical strategy [13] includes the total 
sum over the whole domain, the count of each half of the domain, and so on, terminating with counts of 
individual elements of the domain. The wavelet strategy [22] consists of the matrix describing the Haar 
wavelet transformation. Informally, both strategies achieve low error because they each have low sensitivity, 
O(logn), and every range query can be expressed as a linear combination of just a few strategy queries.^ 
Although both of these strategy matrices offer significant improvements in error for range query workloads, 
neither is optimal. We will use our lower bound to evaluate the quality of these proposed strategies in Sec 4. 

3.2 Measuring and minimizing error 

We measure the error of individual query answers using mean squared error. For a workload of queries, the 
error is defined as the total of individual query errors. 

Definition 3.1 (Query and Workload Error). Let w be the estimate for query w under the matrix mechanism 
using query strategy A. That is, w = A^a(w, x). The mean squared error of the estimate for w using strategy 
A is: 

ErroRa(w) E[(wx - w)^]. 

Given a workload W, the total mean squared error of answering W using strategy A is: ErroRa(W) = 
Ew.GW ERRORA(Wi). 

The query answers returned by the matrix mechanism are linear combinations of noisy strategy query 
answers to which independent Gaussian noise has been added. Thus, as the following proposition shows, we 
can directly compute the error for any linear query w or workload of queries W: 

Proposition 3.2 (Total Error). Given a workload W, the total error of answering W using the extended 
(e, S) matrix mechanism with query strategy A is: 

ErroRa(W) = P{e,5)A\ \\WA+\\% (1) 

where P{e,S) = ^i2I^. 

The proposition above extends the corresponding proposition in [15] and the proof is contained in 
App. A.l. The optimal strategy for a workload W is defined to be one that minimizes total error. 

Definition 3.2. (Minimum Total Error) Given a workload W, the minimum total error is: 

MinError(W) = min ErroRa(W). (2) 

A:WA+A=W 

Our work investigates the error complexity of workloads as it is represented by MinError(W). This 
reflects the hardness of the workload assuming that algorithms are constructed from the matrix mechanism 

^Although there is no benefit in sensitivity when A = W, the matrix mechanism still has lower error than the Gaussian 
mechanism for some workloads by combining related query answers into a more accurate consistent result. 

^Thc approaches in [13, 22] were originally proposed in the context of e-differential privacy, but their behavior is similar 
under (e, 5)-diffcrential privacy. 
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and a total squared error measure is used. Understanding error for this class of mechanisms is a first step 
towards more general lower bounds and helps to assess the quality of a number of previously-proposed 
algorithms included in this class [15, 22, 1, 5]. 

Nevertheless, our results do not preclude the existence of different mechanisms capable of answering 
workload W with lower error. In particular, our error analysis does not include data-dependent algo- 
rithms [20, 11, 18, 4]. The error rates for these mechanisms no longer reflect properties of the workload 
alone, but instead some combination of the workload and properties of the input data, and call for a sub- 
stantially different analysis. In our case, x (i.e., the vector of cell counts corresponding to the database) 
does not appear in (1) above. This means that our minimum error strategy depends on the workload alone, 
independent of a particular database instance. 

The strategy matrix that minimizes total error can be computed using a semi-definite program (SDP) 
[15]. However, finding the solutions of the program with standard SDP solvers takes 0(n®) time, where 
n is the number of cell conditions, making it infeasible for realistic applications. Efficient approximation 
algorithms for this problem are an important open problem that is currently under investigation [16]. Yet, 
approximate — or even exact — solutions to this problem do not provide much general insight into the main goal 
of this paper: to understand the properties of workloads that determine the magnitude of MinError(W). 
The bound we will present in Sec. 4 is important because it closely approximates MinError(W), it is 
easily computable, and it reveals the connection between minimum error and spectral properties. 

3.3 Equivalence and containment for workloads 

Next we develop a notion of equivalence and containment of workloads with respect to error. We will verify 
that the error bounds presented in the next section satisfy these relationships in most cases. 

The special form of the expression for total error in Prop. 3.2 means that there are many workloads that 
are equivalent from the standpoint of error. If WfWi = WJW2, any strategy A that can represent the 
queries of Wi can also represent the queries of W2, and vice versa. In addition, Wf Wi ~ W|"W2 also 
implies ||WiA+|||, = ||W2A+|||, for any strategy A. We therefore define the following notion of equivalence 
of two workloads: 

Definition 3.3 (Workload Equivalence). An mi x ni workload Wi and an TO2 x n workload W2 are 
equivalent, denoted W-^ = W2, i/WfWi = W|^W2. 

The following conditions on pairs of workloads imply that they have equivalent minimum error: 

Proposition 3.3 (Equivalence Conditions). Given an mi x ni workload Wi and an 7712 x 712 workload W2, 
the following conditions imply that MinError(Wi) = MinError(W2).- 

(i) Wi = W2 

(ii) Wi = QW2 for some orthogonal matrix Q. 

(Hi) W2 results from permuting the rows o/Wi. 

(iv) W2 results from permuting the columns o/Wi. 

(v) W2 results from the column projection 0/ Wi on all of its nonzero columns. 

It follows from this proposition that MinError is row and column representation independent, and 
behaves well under the projection of extraneous columns. 

Defining a notion of containment for workload matrices is more complex than simple inclusion of rows. 
Even if the rows of Wi are not present in W2, it could be that Wi is in fact contained in W2 when expressed 
using an alternate basis. The following definition considers this possibility: 

Definition 3.4 (Workload Containment). An mi x n workload Wi is contained in an 7712 x n workload 
W2, denoted Wi C W2, if there exists a W2 = W2 and the rows of Wi are contained in W2. 

The following proposition shows two conditions which imply inequality of error among workloads: 
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Proposition 3.4 (Error inequality). Given an mi x ui workload Wi and an 111-2 x '^■2 workload W2, the 
following conditions imply that MinError(Wi) < MinError(W2).- 

(i) Wi C W2 

(ii) Wi is a column projection 0/ W2. 
App. A. 2 contains tlic proof of Prop. 3.3 and 3.4. 

4 The Singular Value Bound 

In tliis section we state and prove our main result: a novel lower bound on MinError(W), the optimal 
error of a workload W under the extended (e, 5)-matrix mechanism. The bound shows that the hardness of 
a workload is a function of its eigenvalues. We describe the measure and its properties in Section 4.1 and 
prove that it is a lower bound in Section 4.2. In Section 4.3 we briefly discuss the challenge of adapting this 
bound to e-differential privacy. 

Throughout this section, we use the singular value decomposition of a matrix, which is a classic tool of 
matrix analysis. If W is an m x n matrix, the singular value decomposition (SVD) of W is a factorization 
of the form W = QwAwPw such that Qw is an m x m orthogonal matrix. Aw is a m x n diagonal 
matrix containing the singular values of W and Pw is an n x n orthogonal matrix. When m > n, the 
diagonal matrix Aw consists of an n x n diagonal submatrix combined with o(™~")^". In addition, we also 
consider the eigenvalue decomposition of matrix W^W, whose form is W-^W = PwDwPw' where Pw 
is the same matrix as the singular value decomposition of W and is an n x n diagonal matrix such that 
Dw = A^Aw 

4.1 The Singular Value Bound 

We first present the simplest form of our bound, which is based on computing the square of the sum of 
eigenvalues of the workload matrix: 

Definition 4.1 (Singular Value Bound). Given anmxn workload W, its singular value bound, denoted 
as SVDb(W), is: 

SVDB(W)-i(Ai + ... + A„)2, 
n 

where Ai, . . . , A„ are the singular values o/W. 

The following theorem guarantees that the singular value bound is a valid lower bound to the minimal 
error of a workload. The proof is presented in detail in Sec. 4.2. 

Theorem 4.1. Given an m x n workload W, 

MinError(W) > P(e,(5)svDB(W), 

where P{e,5) = ^i2S^. 

In the rest of paper, we refer to SVDb(W) as the "SVD bound". For any workload W, the SVD bound 
is determined by W^W and can be computed directly from it (which can be more efficient): 

Proposition 4.1. Given n x n matrix W-'^W. 

n 

SVDb(W) = -(V Vd,,)2. 

n ^-^ 

i=\ 

where di, . . . ,dn are the eigenvalues 0/ W"^W. 

The SVD bound satisfies equivalence properties analogous to (i), (ii), (ii), and (iv) in Prop. 3.3 and 
inequality (i) in Prop. 3.4. However, it does not satisfy properties related to column projection, as shown in 
the following counter-example. 
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Example 4.1. Given a 2 x n workload W consisting of queries [1, 0, . . . , 0] and [t,t, . . . ,t\. Let fi be the 
column projection w.r.t. the first cell condition o/ W. When n > 8 and t < 1/8, SVDb(W) < SVDB(/i(W)). 

According to Prop 3.4, column projections reduce the minimum error. Therefore, the SVD bound on 
any column projection of W also constitutes a lower bound for the minimum error of W. Because of this 
we extend the simple SVD bound in the following way. Recall that W„ is the set of all column projections. 

Definition 4.2. Given an m x n workload W and U C Un- The singular value bound o/ W w.r.t. U, 
denoted by SVDB[/(W) is defined as 

SVDB[/(W) = niaxSVDB(/x(W)). 

In particular, if U ~ Un, we call this bound the supreme singular value bound, denoted SVDb(W). 

According to Prop. 3.4 and Thm. 4.1, for any U CUn, SVDBt/(W) provides a lower bound on MinError(W). 
Corollary 4.1. Given an m x n workload W, and for any U C Un 

MinError(W) > maxMiNERROR(^(W)) > P{e, 6)sybbu(W), 

where P{e,5) ^ ^i2i^. 

The supreme SVD bound satisfies all of the error equivalence and containment properties, analogous to 
those of Prop. 3.3 and Prop. 3.4. Both of the following theorems are proved in App. A. 3. 

Theorem 4.2. Given an mi x ui workload Wi and an TO2 x n2 workload W2, the following conditions 
imply that SVDb(Wi) = SVDb(W2); 

(i) Wi = W2 

(ii) Wi = QW2 for some orthogonal matrix Q. 

(Hi) W2 results from permuting the rows o/Wi. 

(iv) W2 results from permuting the columns o/Wi. 

(v) W2 results from column projection of Wi on all of its nonzero columns. 

Theorem 4.3. Given an mi x ui workload Wi and an m2 x n2 workload W2, the following conditions 
imply that svdb(Wi) < SVDb(W2).- 

(i) Wi C W2 

(ii) Wi is a column projection 0/ W2. 

While Theorems 4.2 and 4.3 show that SVDb(W) matches all the properties of MinError(W), we often 
wish to avoid considering all possible column projections as required in the computation of SVDb(W). In 
many cases, using SVDb(W) as our lower bound provides good results. In other cases, we can choose an 
appropriate set of column projections to get a good approximation to the supreme SVD bound. We provide 
empirical evidence for this in the following example, along with an application of our bound to range and 
predicate workloads which have been studied in prior work. The bound allows us to evaluate, for the first 
time, how well existing solutions approximate the minimum achievable error. 

Example 4.2. In Table 2 we consider three range workloads on different dimensions, along with the workload 
of all predicate queries. We report SVDb(W) and its ratio with SVDB(7(W) where U contains projections 
onto all possible ranges over the domain, showing that they are virtually indistinguishable. 

We also compute the actual error introduced by several well-known strategies: the identity strategy, the 
hierarchical strategy [13], and the wavelet strategy [22], as well as a strategy generated by a heuristic search 
algorithm proposed elsewhere [16]. These results reveal the quality of these approaches by their ratio to 
SVDb(W). For example, from the table we can conclude that the hierarchical and wavelet strategies have 
error at most 1.5 <o 3 times the optimal for range workloads, but perform worse on the predicate queries. 
The identity strategy is far from optimal on low dimensional range queries, but better on high dimensional 
range queries and predicate queries. 
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Example Workload, W 


svdb(W) 


SVDBi7(W) 


Error, as ratio to P(e, o)svdb(W) 


Identity Hierarchical Wavelet Heuristic 


AllRange(1024) 
AllRange(32,32) 
AllRange(2, 2, 2, 2, 2, 2, 2, 2, 2, 2) 
AllPredicate(1024) 


6.401 X 10*^ 
4.391 X 10^ 
5.242 X 10^ 
4.885 X lO^ss 


1.001 
1.000 
1.000 
1.000 


28.04 1.776 1.529 1.264 
8.154 2.923 1.819 1.079 
2.000 2.000 2.000 1.009 
1.884 3.464 6.292 



Table 2: Four example workloads, their singular value bounds, and their error rates under common strategies 
and strategies proposed in prior work. 



4.2 Proof of the SVD bound 

We now show the proof of Theorem 4.1. The key to the proof is an important property of the optimal strategy 
for the (e,(5) matrix mechanism. As shown in Lemma 4.1, among the optimal strategies for a workload W, 
there is always a strategy A that has the same sensitivity for every cell condition (i.e. in every column) . We 
use to denote the set that contains all such strategies and WA+A = W. 

Recall that the sensitivity of strategy A (Def. 2.8) is the maximum L2 column norm of A. The square 
of the sensitivity is also equal to the maximum diagonal entry of A^A. By using Lemma 4.1, the sensitivity 
of A can instead be computed in terms of the trace of the matrix A^A and minimizing the error of W 
with this alternative expression of the sensitivity leads to the SVD bounds. Ultimately, to achieve the SVD 
bounds, a strategy A must simultaneously (i) minimize the error of W with sensitivity computed in terms 
of trace(A-^A), and (ii) have A G ^w- Such a strategy may not exist for every possible W and therefore 
the SVD bounds only serve as lower bounds to the minimal error of W. 

Lemma 4.1. Given a workload W, there exists a strategy A G such that ErroRa(W) = MinError(W). 

Lemma 4.2. Let D be a diagonal matrix with non-negative diagonal entries and P be an orthogonal matrix 
whose column equals to pi, p2, . . . , p„. 



troce(D) < ^||Dp,||2. 



The proof of both lemmas is in App. A. 3. Now we can prove Theorem 4.1. 

Proof. For a given workload W, according to Lemma 4.1, it has an optimal strategy matrix A e A-w, whose 
sensitivity can then be computed as = i||A|||,. 

Let W = QwAwPw and A = QaAaPa be the singular decomposition of W and A, respectively. We 
have: 

min Ai||WA+||i 

A: WA+A=W 

= min -||A|||,||WA+|||, 
Ae^w n 

= - min IIAaIIfIIAwPwPa^aIIf 
n (AAPA)e.Aw 

> - min ||Aa||^||AwPwPIA+ (3) 
n Aa,Pa 

AwA+^AJAw 



1 " 

> -min(V IIAwPilb)^ (4) 

n Pa ^ 

^(EA.f, (5) 



n 

i=l 
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where Pi is the i-th column of matrix PwPa: ^^e inequahty in (4) is based on the Cauchy-Schwarz inequahty 
and the inequality in (5) comes from Lemma 4.2. 

The equal sign in (4) is satisfied if and only if Aa c>c \/Aw- Therefore to achieve equality in (4) and (5) 
simultaneously, we need A oc Q-y/AwPw for any orthogonal matrix Q. Moreover, (3) is true if and only if 
A € Aw, which may not be satisfied when A cx Q-y/AwPw, therefore the SVD bound only gives an lower 
bound to the minimum total error. □ 

Intuitively, the SVD bound is based on the assumption that the error can be evenly distributed to all the 
cells, which may not be achievable in all the cases. The supreme SVD bound considers only the case that 
the error can be evenly distributed to some of the cells and therefore may be tighter than the SVD bound. 

4.3 Bounding MinError(W) Under the e-Matrix Mechanism 

The SVD bound is defined for the (e, (5)-matrix mechanism, so it is natural to consider extending these results 
to the e-matrix mechanism. Prop. 3.2 can be adopted to the e-matrix mechanism by using an alternative 
privacy parameter: P(e) = and measuring the sensitivity of A as the largest Li norm of the columns 
of A. For any vector, its Li norm is always greater than or equal to its L2 norm. Given a workload W 
and a strategy matrix A, P(e)A^||WA+|||n provides a lower bound to ErroRa(W) under the e-matrix 
mechanism. Therefore error under the e-matrix mechanism is also bounded below by SVDb(W). 

Although the SVD bound works under the e-matrix mechanism, the quality of this bound is not yet 
known. The discussion on the tightness and looseness of the SVD bound in the next section is based on the 
(e, (5)-matrix mechanism and cannot be extended to the e-matrix mechanism directly. 

5 Analysis of the SVD Bound 

In this section, we analyze the accuracy of the SVD bound as an approximation of the minimum error for 
a workload. We show first that the bound is tight by characterizing a class of workloads for which the 
minimum error is equal to the bound. Further, for this class of workloads we can directly construct an 
optimal strategy achieving minimal error. 

Then we show that the bound may be loose, underestimating the minimal error for some workloads. The 
worst case of looseness of the SVD bound is presented in Section 5.2 along with a formal estimate of the 
quality of the bound. We conclude this section with an example demonstrating empirically that error rates 
close to the lower bound can be achieved for workloads consisting of multi-dimensional range queries. 

5.1 The Tightness of the SVD Bound 

To demonstrate that the SVD bound is a tight lower bound, we define a special class of workloads called 
variable agnostic workloads in which the queries on each cell are fully symmetric and swapping any two cells 
does not change W-^W. 

Definition 5.1 (Variable agnostic workload). 

A workload W is variable agnostic if'W'^W is unchanged when we swap any two columns o/W. 

For any variable agnostic workload W, W"^W has the following special form: for some constants a and 
b such that a > b, all diagonal entries of W^W are equal to a and the remaining entries of W-^W are equal 
to b. 

The following theorem shows how to construct a strategy for any variable agnostic workload W, having 
total error equal to SVDb(W). The proof is in App. A. 4. 

Theorem 5.1. For positive integer k and n — 2^ , let W be any m x n variable- agnostic workload with 
singular value decomposition W = Qw-A-wPw- The strategy A = \/ AwPw attains MinError(W) and 
MinError(W) = P(e, 5)svdb(W) = P{e,S)^{^/a + {n - l)b + {n - l)v^^r^)2. 
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As a concrete example, the workload ALLPREDlCATE(n) is variable agnostic, and therefore we can con- 
struct its optimal strategy and compute the error rate directly. 

Corollary 5.1. For n = 2^ and W = ALLPREDlCATE(n), let the singular value decomposition of W 
&e W = QwAwPw- The strategy A = V^wPw attains MinError(W) and MinError(W) — 
P(e, 5)svdb(W) P(e, S)^{n - 1 + V^HTT)^. 

For variable agnostic workloads, using naive strategies consisting of the identity matrix or the workload 
itself results in total error equal to na and the ratio by which the error is reduced using the strategy in 
Thm. 5.1 is approximately 1 — ^- In the case of ALLPREDiCATE(n), the ratio is at least as low as 0.5, which 
occurs when n is very large. 

These results prove formally an empirical observation that has emerged from prior work [15, 22]. Namely, 
that error is unavoidably high for the AllPredicate workload. There are exponentially many queries in 
the workload, and the mechanism cannot significantly exploit the correlation between queries to reduce 
sensitivity because all possible queries are present in the workload. 

5.2 The Looseness of the SVD Bound 

The SVD bound can also underestimate the minimum error when the workload is highly skewed. For example, 
the SVD bound does not work well when the sensitivity of one column in the workload is overwhelmingly 
larger than others. Recall the workload in Example. 4.1, when i — >■ 0, the SVD bound will under estimate 
the total error by a factor of n. This is caused by the underestimate of the sensitivity of A considered in 
equation (3) in the proof of Thm. 4.1. The supreme SVD bound can help us to avoid these naive worst cases, 
but there is no guarantee of its quality with more sophisticated cases. 

Since the proof of Thm. 4.1 constructs a concrete strategy, one way to measure the looseness of the SVD 
bound is to estimate its ratio to the actual error introduced by this strategy. Note that the sensitivity of 
the strategy is the only part of the SVD bound that is underestimated. The square of the sensitivity is 
the maximum diagonal entry of matrix A-'^A, rather than the estimate given by trace(A^A)/n. The ratio 
between the actual sensitivity and the estimated sensitivity bounds the looseness of the SVD bound. The 
following theorem bounds this ratio using W and its SVD bound, which can also be applied to any column 
projection of W so as to estimate the looseness of the SVDB(7(W) for any U CU. 

Theorem 5.2. Given an m x n workload W. 

P(e,5)svDB(W) ^ 1 /svdb(W) 
MinError(W) ~ Aw V n 

where P{e,6) = ^i£&^. 

The proof of this theorem is in App. A. 4 

Despite the extreme case of skewed workloads, for many common workloads the SVD bounds appears to 
be quite close to the minimal error. The following example provides a comparison between the SVD bound 
and achievable error for a few common workloads. 

Example 5.1. Returning to Table 2, we observe empirical evidence that for range and predicate workloads, 
there are strategies that come quite close to the SVD bound. The last column of Table 2 lists the error for a 
heuristic algorithm [16] that attempts to find approximately optimal strategies for any given workload. This 
algorithm is able to find a strategy whose error is within a factor of 1.264 of optimal for AllRange(1024), 
and finds strategies that are virtually optimal for higher dimensional range queries. 

6 An Algebra for Workloads 

In this section we define basic operators of union, negation, and crossproduct on workloads and show how 
our error measure behaves in the presence of these operators. These results provide insight into workload 
error complexity, and will be used for example applications in the next section. 
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SVDB 


SVDB 


Wi 


U W2 


^SVDB(Wi) + ^SVDB(W2) > ^SVDB{Wi U W2) 


^SVDB{Wi) + ^SVDB(W2) > ^SVDB(Wi U W2) 


V 


- w 


V^svdb{v - W) - ||v||2Vm < v'svdb(W) 


max^ew v'svdb(a*{v - W)) - ||At(v)||2v^ < Vsvdb{W) 


Wi 


X W2 


svdb(Wi)svdb(W2) = 


- svdb(Wi X W2) 


SVDB(Wi jSVDB(W2j < SVDB(Wi X W2) 


Predicate Workloads 




k/svDB(-'W) — ^/mn\ < i/svdb(W) 


max^gj^ 


V'SVDB(^^(W)) - < \/sVDB(W) 


Wi 


A W2 


svdb(Wi)svdb(W2) = 


= svdb(Wi a W2) 


SVDB(WijSVDB(W2j < SVDB{Wi A W2) 


Wi 


V W2 


j -y/svDBCWi V W2) — ^mim2nin2 


< ^svdb{^Wi)v'svdb(^W2) 







Table 3: Algebra operators and relations for the simple and supreme singular value bounds. 



Since the results for the supreme SVD bound can be easily derived from the results for the SVD bound, 
we state all theorems in this section in terms of the SVD bound. The proofs of the theorems can be found 
in App. A. 5. Table 3 summarizes the results. 

6.1 Union and Negation 

The union operation on viforkloads has the standard meaning for rovifs of the workload matrix: 

Definition 6.1 (Union). Given an nii x n workload Wi and an TO2 x n workload W2 over the same n cell 
conditions. Wi U W2 is the union 0/ Wi and W2 , the workload consisting of the rows of both Wi and W2, 
without duplicates. 

The relationship between the SVD bounds of workloads and their unions can be bounded: 

Theorem 6.1. Given an mi x n workload Wi and an m2 x n workload W2 on the same set of n cell 
conditions. 

x/svdb(Wi) + v/svdb(W2) > v/svdb(Wi U W2). 

When W consists of predicate queries, then contains the negation of each query in W. This is a 
special case of generalized negation defined for any linear query: 

Definition 6.2 (Generalized Negation). Let W be an m x n workload of queries wi, . . . , Wm and v be an 

n-dimensional linear query vector. The negation of workload W w.r.t to v is denoted v — W and consists of 
rows V — Wi, . . . , V — Wm. In particular, ifW consists of predicate queries and 1 is a vector of all Is, then 
1 — W is the entry-wise "not" operation on W, which we denote simply as ^W. 

The next theorem shows the relationship between the singular value bound of a workload and its gener- 
alized negation. 

Theorem 6.2. For an n-dimensional query v and an m x n workload W, 



^SVDb(v - W) — ||v||2\/m < ■\/svdb(W). 
The negation of a workload is a special case of Theorem 6.2 with v = 1. 
Corollary 6.1. For any m x n predicate workload W : 



\/svdb(^W) - \/mn. < VSVDb(W) . 
6.2 Workload combination 

Given two workloads over distinct sets of cell conditions, we can combine them to form a workload over the 
crossproduct of the individual cell conditions. This is most commonly used to combine workloads defined 
over distinct sets of attributes Bi and B2 to get a workload defined over Bi U B2. When we pair individual 
predicate queries, we can pair them disjunctively or conjunctively. 



15 



Definition 6.3 (Workload combination). Given an mi x ni workload Wi defined by cell conditions $ = 
■ ■ ■ 4>ni o,nd an 7712 X '^■2 workload W2 defined by distinct cell conditions = tpi . . . 'ipn2 ; <^ ^ew combined 
workload W is defined over cell conditions A tpj | 0i S $, tpj € For each Wi = (wi^i, . . . , Wm^i) G Wi 
and W2 = (^1^2, ■ • ■ , Wn2.2) G W2, there is a query w £ W accordingly: 

• (CrOSSPRODUCt) // f/ie entity 0/ w related to each cell condition (pi A "0-/ is wi^i ■ W2,j , W is called the 
crossproduct 0/ Wi and W2, denoted as Wi x W2. 

• (Conjunction) If both Wi anrf W2 consist of predicate queries and the entry o/w related to each cell 
condition <f>i/\7pj is wi,i /\W2,j , then W is called f/ie conjunction o/Wi and'W2, denoted asWiAW2. 

• (Disjunction) If both and'W2 consist of predicate queries and the entry o/w related to each cell 
condition (piM ^ j is wi^iW W2j , then 'W is called the disjunction of W i and'W2, denoted as 'WiM'W 2- 

The next theorem describes the singular value bound for the crossproduct of workloads: 

Theorem 6.3. Given an mi x ui workload Wi and an m2 x 712 workload W2 defined on two distinct sets 
of cell conditions: SVDb(Wi x W2) = SVDb(Wi)svdb(W2). 

The conjunction of predicate queries is a special case of crossproduct, and disjunction of two workloads 
can be considered as the negation of conjunction of the negated workloads. Thus, applying Corollary 6.1 
and Theorem 6.3 to predicate workloads we have: 

Corollary 6.2. Given an mi x ni workload Wi and an m2 x n2 workload W2, both of which consist of 
predicate queries. 



7 Applications and Discussion 

In this section, we consider some practical applications of the S VD bound and also investigate its relationship 
to the concept of VC-dimension. 

7.1 Efficient Computation of the SVD Bound 

To use the SVD bound to explore workload error complexity for practical workloads we would like to compute 
it efficiently. For an m x n workload W, computing SVDb(W) directly from W takes O(mn^) time. For 
some workloads, the nxn matrix W"^W can be computed directly without materializing W. In this case we 
can compute the SVD bound directly from W"^W in 0{n^) time by applying Prop. 4.1, greatly improving 
the efficiency if m ^ n. For example, W^W can easily be computed directly for the ALLRANGE(n) and 
ALLPREDiCATE(n) workloads. Computational cost is improved from O(n^) to 0{n'^) for AllRange and 
from 0(?i^2") to O(n^) for AllPredicate. 

Further, it is often the case that workloads of multi-dimensional queries are formed by defining a set 
of queries of interest on individual dimensions and combining them by conjunction. For example, the 
workload of all k dimensional range queries can be constructed by combining one-dimensional range queries 
on each dimension by conjunction. In this case, the overall number of cell conditions may be large, but the 
computation of the SVD bound can be performed on a per-attribute basis by applying the crossproduct rule 
of Thm. 6.3. For example, for the workload of all k dimensional range queries with size n on each dimension, 
using the crossproduct rule improves the computation cost from 0(71^*^) to 0{n^ + k). 



svdb(Wi a W2) 



= svdb(Wi)svdb(W2), 



y svdb(Wi V W2) - ^Jruirmrnm 



< 1/ SVDB(^Wi) 1/ SVDB(^W2). 
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7.2 The Benefit of Workload Specialization and the Cost of Collusion Protec- 
tion 

When using a differentially private server, an analyst must decide to what degree to specialize their workload. 
For fixed privacy parameters, submitting a more specific workload may allow better error rates but may limit 
the breadth of tasks that can be carried out. A related issue arises when multiple analysts submit workloads. 
Suppose for simplicity that each analyst has an equivalent privacy budget. If there is a risk of collusion, 
then a single synthetic data set must be generated to support the union of their workloads, using only the 
privacy budget of an individual analyst. We would like to understand the additional cost to accuracy that 
results from protecting against collusion in this manner. One measure of this cost is the difference between 
the workload complexity of an analyst's workload alone, and the workload complexity of the union of the 
workloads. 

Existing techniques do not provide a principled way to address these questions, but the SVD bound 
can help. In the following example we present some preliminary empirical findings concerning randomly 
generated workloads. 

Example 7.1. Assume e = 0.5, 6 ~ 0.0001, and form a workload by randomly selecting 0.1% of the queries 
from AllRange(1024). The SVD bound on these random workloads is quite stable and suggests that we can 
expect to achieve average per query error of about 1045. But we can simultaneously answer all range queries 
with average per query error of just 1394. Thus the benefit of specializing to the small workload is modest. 
If we sample 1% of the queries from AllRange(1024) they have per query error around 1371, showing the 
benefit of specialization is negligible. 

Further, if we imagine two analysts, each choosing randomly a sample consisting of 1% of the queries 
from 

AllRange(1024), we find that the bound on error of answering the individual workloads differs by less than 
2% from the bound for answering the combined union of the workloads, indicating that the cost of collusion 
protection is low. 

These results are somewhat surprising, but we expect that different properties would emerge for more 
realistic non-random workloads. Further theoretical investigation of workload specialization is an interesting 
direction for future work. 

7.3 Connection to Learning Theory 

Techniques from learning theory have been used to illuminate the hardness of query workloads. In particular, 
Blum et al. [3] showed that for any set of predicate queries, it is possible to release a differentially privacy 
synthetic database with error bounds that grow with the size of the database and Vapnik-Chervonenkis (VC) 
dimension [21] of the workload. 

Unfortunately we cannot directly compare our error bounds with those arising from their techniques. 
First, they consider standard differential privacy while we consider approximate differential privacy. Second, 
their results only provide an upper bound on error for a workload, not a lower bound. And third, their 
results consider only sets of predicate queries while we are interested in all linear queries. 

Nevertheless, we show below an interesting connection between the SVD bound and VC dimension, and 
that, at least for some common workloads, the SVD bound appears to provide a more precise yardstick for 
understanding workload error complexity. 

Background on VC Dimension 

The VC dimension measures the complexity of a set of functions on a space. It can also be interpreted with 
respect to workloads consisting of predicate queries. Here we review the basic definition and represent it in 
vector form in order to relate it to the SVD bound. To define the VC dimension, we need to first describe 
the definition of a range space and shattering. 



17 



Definition 7.1 (Range Space). A range space is a pair {X,R) where X is a (finite or infinite) set and R 
is a (finite or infinite) set of subsets of X . 

Definition 7.2 (Shattering). Given a range space {X, R) a set A (Z X , A is shattered by R if \{rr]A\r e 
i?}|=:2l^!. 

The VC dimension of a range space is defined to be the largest shattered set. 

Definition 7.3 (VC Dimension). Let S = {X,R) be a range space. The Vapnik-Chervonenkis dimension of 
S, denoted VC{S), is the cardinality of the largest subset of X that is shattered by R. If there are arbitrarily 
large shattered sets, VC{S) = oo. 

VC dimension of workload matrices 

When applied to predicate queries, X corresponds to the set of all cell conditions and i? is a set of subsets of 
X. Notice that each subset of X can be represented as a predicate query, and that the intersection between 
two sets is equivalent to the entry- wise "AND" operation between two queries (denoted by &). Therefore the 
shattered set, as well as the VC dimension, can be defined in terms of query vectors and workload matrices. 

Definition 7.4 (Workload Shattering). Given a workload of predicate queries W and a predicate query q, 
q is shattered by W if |{q& w|w e V^}| = 2llilli. 

Definition 7.5 (Workload VC dimension). Given a workload W, its VG dimension, denoted VG(W), is 
the maximum sum of entries of the queries that are shattered by W. 

The VC dimension of a workload has an elegant alternative formulation related to projection and subma- 
trices of all predicate queries. In Def. 7.4, if a predicate query q is shattered by W, the set {w & q|w e W} 
contains 2ll'illi different queries, which is the number of all predicate queries defined on the cells with qi = 1. 
Therefore, let /ig be the projection to all the cells with qi ~ 1. Then q is shattered by W if and only if 
/Xg(W) is the workload consisting of all predicate queries on the projected cells, which leads to the following 
theorem: 

Theorem 7.1 (VC Dimension using projection). 

Given a workload W, the VG dimension of W is the largest number k such that there exists a projection /i 

such that /-i(W) = ALLPREDICATE(fc). 

Based on the Blum et al. [3] algorithm for releasing a synthetic database designed for a given linear 
workload, one can derive an upper bound for the error of the workload using its VC dimension, as shown by 
Kasiviswanathan et al. [14]: 

Proposition 7.1 ([3, 14]). Given an m x n workload W over a database with size d, the Blum et al. 
mechanism gives a synthetic database that answers the queries in W under e- differential privacy with total 
error 6{m{md^ yC(W)/e)2/3). 

Though the upper bound cannot be compared with the SVD bound directly, for fixed database and e, 
there are workloads with tremendously different SVD bound that have the same number of queries and VC 
dimension, as shown in the following example. 

Example 7.2. Given a database on 2048 cells, let workload Wi be the cumulative distribution (GDF) 
workload over the cells. Let workload W2 contain queries on each individual cell of the first 2047 cells, along 
with the query summing all cells. Both of the workloads have 2048 queries and the same VG dimension. But 
the SVD bound illustrates the difference between their complexity: SVDb(Wi) > 20059 and SVDb(W2) = 
2136. In addition, W2 is a variable agnostic workload on the first 2047 cell so that the supreme SVD bound 
on W2 is tight. Therefore the achievable errors of these two workloads is guaranteed to differ by a factor of 
10. 

Therefore, in this case, the SVD bound appears to offer a more reliable measure of workload complexity 
than one based on VC dimension. 
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8 Related Work 



The original description of the matrix mechanism [15] focuses primarily on e-differential privacy, with a 
brief consideration of (e, (5)-differential privacy. A number of proposed mechanisms can be formulated as 
instances of the matrix mechanism. Techniques for accurately answering range queries are presented in 
[22, 13]. Low-order marginals are studied in [1] using a Fourier transformation as the strategy (combined 
with other techniques for achieving integral consistency). An algorithm for generating good strategies for 
answering sets of data cube queries is introduced in [5] . 

As mentioned above, Blum et al. [3] describe a very general mechanism for synthetic data release, in 
which error rates are related to the VC dimension of the workload. Unfortunately it cannot be implemented 
efficiently in general as it requires sampling from a probability distribution defined over all possible databases. 
Hardt et al. [12] present a lower bound on error for workloads of predicate queries. It is a lower bound on all 
data independent mechanisms and therefore is more general than the SVD bound. Under (e, (5)-diffcrcntial 
privacy, this bound is proportional to the geometric average of the eigenvalues of the workload. The fact 
that the SVD bound is associated with the algebraic average of the eigenvalues suggests the SVD bound 
may be a tighter bound. However the two bounds are not directly comparable since the Hardt et al. bound 
is measured by the expected L2 distance. An important open question is whether the possible gap between 
these bounds results from a limitation in the expressiveness of the matrix mechanism, looseness in the bound 
by Hardt et al, or both. 

Analyzing the error of data dependent mechanisms is fundamentally different and our bounds do not 
apply. Lower bounds on answering all A:-way marginals with a data dependent mechanism is discussed while 
in [14]. A data dependent approach for range queries is described in [4]. Data dependent mechanisms for 
predicate queries are presented in [20, 11]. Dwork et al. provide a general error bound using an arbitrary 
differentially private mechanism [9] but not specifically for linear counting queries. 

9 Conclusion 

We have shown that, for a general class of (e, (5)-difi'erentially private algorithms, the error rate achievable for 
a set of queries is determined by the spectral properties of the queries when they are represented in matrix 
form. The result is a lower bound on error which is a simple function of the eigenvalues of the query matrix. 
The bound can be used to assess the quality of a number of existing differentially private algorithms, to 
compare the hardness of query sets, and to guide users in the design of query workloads. 
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A Proofs 



The following appendices contain proofs that are omitted from the body of the paper. In those proofs, we 
say A a strategy on workload W if WA+A = W. 

A.l The Extended Matrix Mechanism 

In this section we present the proofs related to the extended matrix mechanism. 

Proposition 3.1 (Extended (e, (5)-Matrix Mechanism). Given an mx n query matrix W, and a px n 
strategy matrix A such that WA"*"A = W, the following randomized algorithm Ma is {e, S)- differentially 
private: 

7Wa(W,x) ^ Wx + WA+AformaZ(cr)™. 

where a = AA\/2hi(2/S)/e. 

Proof Noticing WA+A = W, 

7Wa(W, x) = Wx + WA+Normal(cr)™ = WA+(Ax + Normal(cr)™). 

Since a — Aa ■\/log(2/(5)/e, according to Proposition 2.1, answering Ax with Ax + Normal(cr)'" satisfies the 
(e, (5)-differential privacy. Thus A^a(W,x) satisfies (e, 5)-difFerential privacy as well. □ 

Proposition 3.2 (Total Error). Given a workload W, the total error of answering W using the extended 
(e, S) matrix mechanism with query strategy A is: 

Errora(W) = P(e,5)Ai ||WA+|||. (6) 

where Pie, S)^ ^'°f^^K 

Proof. According to Def. 3.1 and Prop. 3.1, given a query w and a strategy A, the mean squared error is 
ErroRw(A) = E((wx - w)2) = Var(w) = Var(wA+Normal(cr)™) 
= (72||wA+||2 = P(e,,5)Ai||wA+||2. 
Therefore for a given workload W, 

ErroRw(A) = P{e,d)A\\\wA+\{?, = P{e,S)A\\\WA+\\%. 

wGW 

□ 

A. 2 Workload Containment and Equivalence 

This section contains the proofs of the relationship between workloads and their minimal errors in Section 3.3. 

Proposition 3.3 (Equivalence Conditions). Given an mi x ui workload Wi and an m2 x n2 workload W2, 
the following conditions imply that MinError(Wi) = MinError(W2); 

(i) Wi = W2 

(a) Wi = QW2 for some orthogonal matrix Q. 

(Hi) W2 results from permuting the rows o/Wi. 

(iv) W2 results from permuting the columns o/Wi. 

(v) W2 results from the column projection o/Wi on all of its nonzero columns. 

Proof, (i): If Wi = W2, for any strategy A, 

||WiA+||^ = trace((A+)'^WfWiA+) 

= trace((A+)^W^W2A+) = ||W2A+|||,. 
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Therefore MinError(Wi) = MinError(W2). 

(ii) : It is equivalent with (i). 

(iii) : It is a special case of (ii) where Q is a permutation matrix. 

(iv) : Let P is the permutation matrix such that WiP = W2. For any strategy A on Wi, AP is a strategy 
of W2 and Errora(Wi) = ErroRap(W2). 

(v) : Since (iv) is true, we can assume Wi — [W2, 0]. For any strategy matrix A2 on W2, Ai = [A2, 0] is a 
strategy on Wi and 

||WiA+|||, = trace(A+(A+)'^WfWi) 

= trace([A+, 0]^[A+, 0][W2, 0]^[W2, 0]) 
= tracc((At)^W^W2At) ||W2At|||. 

For any strategy Ai on Wi there is a strategy on W2 with equal or smaller error formed by deleting 
corresponding columns from A2. □ 

Proposition 3.4 (Error inequality). Given an mi x ui workload Wi and an 1112 x 712 workload W2, the 
following conditions imply that MinError(Wi) < MinError(W2); 

(i) Wi C W2 

(ii) Wi is a column projection 0/ W2. 

Proof. For (i), let W2 = W2 such that contains all rows of Wi. According to Prop. 3.3 (iii), we can 
assume = [ ] . For any strategy A on W2 , since A is also a strategy on W2 , A can represent all 
queries in Wi as well. Thus A is a strategy on Wi. In addition, 

ERRORa(W2) 

= P(6,<5)Ai||W2A+|||, 

= P{e,6)Al\\W',A+\\l 

= P(e,(5)Ai(||WiA+||2, + ||W3A+|||,) 

= Errora(Wi) + Errora(W3) > Errora(Wi). 

Therefore, MinError(Wi) < MinError(W2). 

For (ii), given a strategy A2 on W2, let Ai be a column projection of A2 using the same projection 
that generates Wi from W2. According to the construction of Ai and Wi, since W2 = W2A^A2, we 
have W2A^Ai = Wi. Therefore according to Theorem 2.1, ||W2A^||i? > ||WiA^||i?. Furthermore, since 
Aa2 > Aai, we know ErroRa2(W2) > ErroRai(Wi). □ 

A. 3 The Singular Value Bound 

We first present the proof of lemmas that arc required for the proof of Thm 4.1. After that, we prove the 
theorems showing that the supreme SVD bound satisfies the error properties. 

Lemma 4.1. Given a workload W, there exists a strategy A <E Aw such that Errora(W) ~ MinError(W). 

Proof. For any workload W, the problem of finding a strategy that minimizes the total error of W can 
be formulated as a SDP problem [15]. Therefore the optimal strategy that minimizes the total error of 
W always exists. Let A' be an optimal strategy on workload W. We now construct a matrix A from A' 
such that A S ^w- Let di,. . . ,dn denote the diagonal entries of matrix A^I — A'^A', i.e. di, . . . ,dn 
is the difference between each diagonal entry of A'^A' and the maximal diagonal entry of A'-^A'. Since 
di, . . . , (i„ > 0, let D be the diagonal matrix whose diagonal entries are y/di, . . . , and A = [■^' ] • Then 
A is a strategy matrix such that the diagonal entries of A"^A are all the same. Let B = [A'+,0]. Then 
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WBA = WA'^A' = W. According to Theorem 2.1, ||WA+||f < ||WB||f. Recall Aa = Aa', we have, 

Errora(W) = P(e,(5)Ai||WA+||| 

< Pie,6)Al\\WB\\% 

= P(e,(5)Ai,||WA^|||, 

= Errora'(W) = MinError(W). 
Therefore ErroRa(W) = MinError(W) and A is an optimal strategy for workload W. □ 

Lemma 4.2. Let D be a diagonal matrix with non-negative diagonal entries and P be an orthogonal matrix 
whose column equals to pi , p2 , . . . , p„ . 

n 

trace(D) < ^||Dp,||2. 

1=1 

Proof. Use di to denote the diagonal entries of D and pij to denote the entries in P. Noticing that J2]=i P% = 
1, we have 



\ 



Therefore, since X]J=iPfj- = 1^ 



ll'^P'll2 >T.Y.pI'^o = E(Ep?^)^^ = trace(D). 

□ 

Theorem 4.2. Given an mi x ui workload Wi and an TO2 x n2 workload W2, the following conditions 
imply that SVDb(Wi) = SVDb(W2); 

(i) Wi = W2 

(a) Wi = QW2 for some orthogonal matrix Q. 
(Hi) W2 results from permuting the rows o/Wi. 
(iv) W2 results from permuting the columns o/Wj. 

(v) W2 results from column projection of Wi on all of its nonzero columns. 
Proof, (i) (ii) (iii): Since any one of those conditions leads to W^Wi = W|"W2, according to Prop. 4.1, 

SVDB(Wi) = SVDB(W2). 

(iv) : Given a workload Wi, it is sufficient to prove that the singular values of Wi are column represen- 
tation independent. Let W2 be a matrix resulting from a permutation of the columns of Wi and P be the 
permutation matrix such that WiP = W2. If a singular value decomposition of Wi is Wi = QwAwPw, 
then the decomposition of W2 is W2 = QwAwPwP- Since PwP is still an orthogonal matrix, W2 = 
QwAwPwP is a singular value decomposition of W2. Therefore the singular values of W2 are exactly the 
same as the singular values of Wi . 

(v) : Since W2 is a column projection of Wi, SVDb(Wi) > SVDb(W2) by definition. In addition, for 
any matrix with columns of zeroes, removing thse columns will not impact the non-zero singular values of 
the matrix. Therefore projecting those columns out will reduce the total number of singular values but not 
their sum. Therefore projecting out all zero columns from Wi will not decrease SVDb(Wi), which indicates 
svdb(Wi) = svdb(W2). □ 

Theorem 4.3. Given an mi x ni workload Wi and an TO2 x n2 workload W2, the following conditions 
imply that SVDb(Wi) < SVDb(W2); 

(i) Wi C W2 
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(a) Wi is a column projection 0/W2 



Since (ii) is naturally satisfied according to the definition of SVDb(W), it is sufficient to prove (i). Here 
we prove it is true even for the SVD bound so that it is also true for the supreme SVD bound. 

Proof. Given Wi C W2, according to the definition, there exists a workload such that W2 = W2 and 
W2 contains all the queries of Wi. Then has the following form: 

Wi 



Then 



Let Wi 
Then 



WTW2 



WfWi 



w. 



W2^W2 



WfWi 



= w.^ >- 



QiAiPi and W2 = Q2A2P2 be the singular value decomposition of Wi and W2, respectively. 



W2^W2 



P2 A^P2 



Pf A^Pi y 



Wi ^ ^ 

^ -P2PfA?PiP'^ 
Vi, A, > ||AiPj||2 



2ir^ i\.jr-ir-2 > 



where is the i-th diagonal entry of A2 and Pi is the i-th column vector of PiP|". The inequality in the 
last row based on the property that the diagonal entries of any positive semidefinite matrix are non-negative. 
Therefore, according to Lemma 4.2, 



trace(A2) = ^ Xi > ^ ||AiPi||2 > trace(Ai). 

1=1 



4=1 



(7) 
□ 



A. 4 Analysis of the SVD Bound 

Two theorems that are respectively related to the tightness and the looseness of the SVD bound are proved 
here. 

Theorem 5.1. For positive integer k and n = 2'"", let W be any m x n variable- agnostic workload with 
singular value decomposition W = QwAwPw- The strategy A = vAwPw attains MinError(W) and 
MinError(W) = P(e,(5)svDB(W) = P(e, 5)^{^/a + (n - 1)6 + {n - l)Va^)^. 



Proof. For the strategy A = vAwPw, if A e ^, 

Errora(W) = P(e, ^)svdb(W). Since W is a variable agnostic matrix, W'^W has the following special 
form: 

a b ... b 
b a ... b 



b b 



where a > b. One can verify that a + {n — 1)6 is an eigenvalue of W"^W with order 1 and a — 6 is an 
eigenvalue of W-^W with order n — 1. Let Pw = {pij)nxn- Noticing the rows of Pw are the eigenvectors of 
W^W, without loss of generality, assume its first row contains the eigenvector associated with eigenvalue 
a + {n — 1)6. In addition, since W^Wl = (a + (n — 1)6)1, (1/ y^a+(ri^^l)6) 1 is an unit-length eigenvector 
of W-^W associated with eigenvalue a + 



1)6. The i ' diagonal entry of matrix A A 



PwAwPw 
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can be computed as: 



(a + (n - l)b)pl, + E(« - = 1 + (a - 

^ a — b 

^ a+{n-l)b 

Thus A £ A and so it is a strategy for W such that 

ErroRa(W) = P(e,(5)svDB(W). Therefore the SVD bound is achieved. 

Theorem 5.2. Given an m. x n workload W. 

P(e, J)svdb(W) 



P'u) 



□ 



> 



MinError(W) ~ Aw 



svdb(W) 



2lQg(2/6) 



where P{e, 6) 

Proof. Let W = QwAwPw be the singular value decomposition of W and A = \/AwPw- Noticing 



P(e,(5)svDB(W) 1 /svdb(W) 



Errora(W) 

it is sufficient to proof that < Aw- 

Let Pw = {pij)nxn and Ai, A2, . . . , A™ be the diagonal entries of matrix Aw- The i-th diagonal entry of 
A^A and W^W are ELiP?,^, and ELiPf^^' - respectively. Recah ELiPf, = 1, 



Ai = max Y.P%^:> ^ \ J2p%^^ = ^w- 



□ 



A. 5 The SVD Bound and Operations on Workloads 

Here we prove relations between the SVD bound and the operations on workload- The proofs of union and 
generalized negation are related to the relationship between singular values of matrices and their sum, as 
stated in the theorem below. 

Proposition A.l ([10]). Given two n x n matrices Wi and W2 with singular values /ii, /X2, . - . , /x„ and 
Ai, A2, - - - , A„ respectively. Let (jji, 4)2, ■ . ■ , 4>n be the singular values 0/ Wi + W2, then 



1=1 



E M» + E - 

i=l i=l 

For any m x n matrix W, there always exists an n x n matrix W such that the nonzero singular values 
of W and W are all the same. Theorem A.l holds even if both Wi and W2 are m x n matrices, which 
leads to the relationship between the SVD bounds of two workloads and their sum. 

Theorem 6.1. Given an mi x n workload Wi and an 77*2 x n workload W2 on the same set of n cell 
conditions. 

\/svdb(Wi) + v/svdb(W2) > Vsvdb(Wi U W2). 
Proof. Let W = Wi U W2. Expand Wi, W2 to two (toi + 7712) x n matrices as follows: 



Wi 







V^2 



Since Wj^ and W2 have the same singular values as Wi and W2, respectively, SVDb(W']^) = SVDb(Wi), 
SVDb(W2) = SVDb(W2). Furthermore, since 

' Wi 



W2 



D W, 
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according to Prop. A.l, the sum of the singular values of Wj^ and W2 is larger than or equal to the sum of 
singular values W. Therefore, with Thm. 4.3 and Prop. A.l, 

^SVDB(Wi) + ^SVDB(W2) 



/svdb(w;) + y^svDB(W^) > Vsvdb(W)- 
Theorem 6.2. For an n- dimensional query v and an m x n workload W, 



□ 



^SVDb(v — W) — ||v||2\/m < ^SVDb(W). 

Proof. Let W = Iv-^ — W. Noticing the singular values of Iv^ are ||v||2-\/to: 0, . . . , 0, apply Prop. A.l to 
W + W = Iv"^ and — W + Iv"^ = W, and we have the theorem proved. □ 

The property of crossproduct can be proved by constructing a proper representation to the resulting 
workload. 

Theorem 6.3. Given an mi x ui workload Wi and an TO2 x n2 workload W2 defined on two distinct sets 
of cell conditions: SVDb(Wi x W2) = SVDb(Wi)svdb(W2). 

Proof. Let wi and W2 be queries in Wi and W2, respectively and W = Wi x W2. Consider the vector 
representation of wi and W2: wi = [wu, W12, . . . , wi„J"^, W2 = [^21,1022, ■ • • , W2n2]'^ ■ The crossproduct of 
Wl and W2, denoted as w can be represented as an ni by ??2 matrix, whose entry is equal to wiiW2j. 
In another word, 

T 

W = W1W2 . 

We can the represent w as a vector, denoted as w', which is a 1 x nirt2 vector that contains entries in w 
row by row. Therefore, 



W' = [wiiW2, W12W2, . . . , WinV/2Y 



More generally, using W^^ to denote the j) entry in Wi, W can be represented as the following matrix: 



W 



whW2 
^lW2 



WI2W2 
W22W2 



'4n 



W2 
W2 



<iW2 Wi2^2 •■■ Wi,^n,^2 



Let vi, V2 be the eigenvectors of WfWi, W|'W2 with eigenvalues Ai, A2, respectively. Let the vector 
representation of Vi be vi = [vin,V2n, • • • , vin]'^ ■ Consider the following vector 

V = [U11V2, W12V2, . . . , Wl,iV2]'^, 

According to block matrix multiplication, 

ELi ^»'yiiW2V2 



Wv 



A1W11A2V2 
A1W12A2V2 



A1A2V. 



AlWl„A2V2 

Thus V is an eigenvector of W-'^W with eigenvalue A1A2. Since W^Wi and W|"W have ni and n2 
orthogonal eigenvectors, respectively, we can find nin2 orthogonal eigenvectors with this method. Noticing 
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W-^W only has nin2 eigenvalues, the eigenvalues of those nin2 eigenvectors are all the eigenvalues of W'^W. 
Let All, ■ • • , Ai„ be the eigenvalues of Wf Wi and A21, . . . , A2n be the eigenvalues of W^W2. 

SVDB(W) = ^( V y/M~X^)^ 

nin2 ^-^ 

l<i<ni ,l<j<n2 
-, "1 1 "2 

m ^ n2 ^ 

= svdb(Wi)svdb(W2). 

Though we use a specific rule above to represent the query cross products as query vectors, according to 
Theorem 4.2, the SVD bound is independent of the rule of representation. Thus we have the theorem proved 
in arbitrary cases. □ 
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